Introduction

Notebook Objective

This notebook documents the data engineering required to process 6 months (Jan 2021-Jun 2021) of SMRT LVDT datasets into usable data for the end goal of building a machine learning models to:

(1) Classify cases of potential degrading sections of 3R in the NSL into 4 catagories depending on urgency - e.g. SI1 - SI4
(2) Predict the time taken for a section of 3R identified to be problematic to degrade into a specific state - e.g. 160mm 3R Vertical Gauge

Glossary of Terms
Term/Abbreviation Definition/Meaning
3R The third rail, a power rail running adjacent to the two running rails on which the train rolls on
CCD Shoe The Current Collector Device shoe is a sheet of carbon attached to the CCD assembly to contact the 3R
Chainage A referencing system for identifying locations on the track by distance, each unit difference between chainage is 1m
Contact Range The chainage range where the 3R and CCD Shoe are in contact
Floating Range The chainage range where the 3R and CCD Shoes are not in contact and the CCD shoe is 'floating' in the air
LVDT A linear variable differential transformed sensor for measuring height of 3R and converting it to an electrical signal
MA Moving Average
Ramp Contact Range The chainage range where the CCD Shoes are in contact with the ramp
ROC Rate of Change

Note: Chainage is not necessarily in running order, for example chainage can jump from 1000 to 2000 with no physical location correspinding to chainage between 1000-2000. Additionally, a chainage can refer to two locations at once if not tagged with the sector it belongs to (e.g. Khatib-Yio Chu Kang could have a chainage 500 in its range and this chainage 500 could also exist in Admiralty-Woodlands)

Overview of Steps
  1. Extract data from all LVDT files
  2. Understand the characteristics of the LVDT data
  3. Isolate the LVDT data into the three categories: 'contact', 'ramp' and 'floating'
  4. Normalize each data set (by bound, left/right/, date, train) on a specific value so that they can be compared
  5. Compare data sets across time to get rate of change of measurements over time

Combining the relevant CSV files into a Pickle

This is done so that we do not need to iterate through all the csv files to concatenate the data in the future

Excel%20Files.PNG

The cell below can be skipped if the 'lvdtdata.pkl' 'file is available for use

Some quick improvements identified from browsing the data: (1) The dates are not in date format, (2) The decimal point values after 3 d.p. are unnecessary

Find relationships within/between variables EMU Number, Bound, Date

This step is done to understand the general charactersitics of the data. pandas_profiling is good for an initial general answer to some of these questions

Some quick insights identified: (1) 3R Vertical Gauge looks normally distributed. (2) Based on the Vert_Gauge histograms, >215 looks like the point where the LVDT hovers when not contacting the 3R or ramps, (3) The symmetry between the range 180-210 is likely the region when the LVDT is measuring 3R ramp vertical gauge

3R%20Vertical%20Gauge%20Histogram.PNG

Sensor performance check

Check if each sensors on the train (left and right) behave in the same way
Check if sensors on different different trains behave in the same way

First off, the good news is that the distributions for the vertical gauge range corresponding with contact between the 3R are generally normally distributed. This is important because it is expected that should the population of maintenence staff all attempt to gauge the 3R to the same value (170mm - the nominal value for 3R vertical gauge), the deviation from the target should indeed follow such a distribution.

Note: There is a slight but noticeable bias towards higher values for the measurements in the 'contacting range' (where the CCD shoe is touching the 3R) - this bias is either due to the sensor, or the actual physical state of the 3R (i.e. staff are intentionally fixing the 3R gauge slightly above 170mm.)

Statistically, having a normal distribution opens up the doors to many statistical analysis methods and tools for comparing and manipulating the distributions.

However, the vertical gauge in the 'floating range' (where the CCD shoe is not touching the 3R seems to be made up of multiple distributions - possibly due to the different resting positions of each sensor and the fluctuations it experiences while the train rumbles along the track. Using an example to explain this phenomenon, EMU501's and EMU533's left sensor seems to have one main rest spot around which the sensor readings fluctuate, hence the obvious single peak in both cases while EMU501's right sensor has 3 peaks, Potentially indicating 3 different resting spots of the CCD shoe.



Note: SMRT's convention of left and right is defined by the direction of increasing chainage, hence on both the NB and SB, left will refer to the same side and thus the same sensor

Hardware Hypotheses As suspected, the different EMUs have slight but noticeable different distribution shapes for the vertical gauge measured, this could be due to sensor or installation differences or a change in the state of the EMU (e.g. wear or change of CCD shoe at some point in time, different voltage regulation, different mechanical resistance of the CCD assembly, route travelled e.g. more NB han SB) that resulted in variation in the data collected. It could also be that the calibration of the sensors were done differently.
The ideal case would have been identical distribution shapes shared by all the Left Verts (since they measure the same 3R on both bounds) and another distribution shared by all the Right Verts.
Software Hypotheses In terms of software, it could also be that the sensor data logging and processing system may be artificially re-classifying data into certain values. This may have implications further down the analysis when a vertical gauge comparisons across days and between EMUs is to be done (e.g. a EMUs may have a higher mean than another, suggesting an offset is required for fair comparison).

Difference in readings between bounds

Let us test the bound difference hypothesis by further splitting the data by bound and zooming into the 3R gauge

Clearly the bounds have a different distribution of Left 3R and Right 3R (it seems most of the 3R on the NB is on the right and most of the 3R in the SB is on the left). It makes sense that the Left and Right cases have an inverse relationship given that if there is 3R on the Left, no 3R is required on the Right.

Also noticeable is the different skews of floating positions of the LVDT of the Left and Right sensors for each train and the generally similar skews when the Left or Right sensors are on either bound

The LVDT 3R reading contact range (150-190) generally has the shape of a Normal Distribution.

For the floating range (200-230), sometimes a clear double peak can be seen regardless of whether its a majority/minority 3R on Left/Right situation.

For the ramp contact range, it can be understood that the range lies between the contact range and the floating range - merging into both distributions due to the noise and the fact that physically, the ramps are a transition from the normal 3R range to the floating range. One thing that is for sure is that the ramp/conductor-rail fraction is much smaller than 1 and the ramp has an almost constant gradient - meaning that we expect a uniform distribution for the measurements at the ramp contact range as samples are taken along the ramp at different heights. This means that the total area of the conductor-rail region is larger than the area of the ramp (transition length region) - the accuracy of which is contingent on the tachometer of the EMU (the LVDT has a constant sampling rate based on distance and does not give more LVDT readings/m at the ramp).

Note: The sampling rate of the LVDT is approximately 10/m, meaning that the resolution of 3R gauge is around 10cm.

Using rate of change of 3R gauge (across distance) to isolate ramps chainage

Attempt to use rate of change to find chainages that belong to ramps under the assumption that the rate of change of 3R gauge across distance at ramps is greater than elsewhere (since ramps are a slanted and curved piece of 3R).

It is apparent that there are ROC spikes spaced out across the chainages, these could possibly correspond with ramp chainages

Visually, 2-3mm/m looks like a good value to set a cutoff point between ROC at conductor-rail and ROC at ramps. To check this assumption, we take the quantile that corresponds to the 3mm ROC. This turns out to be the 99.5 percentile. Meaning that for every 99.5m of distance 0.5m is a ramp. A HSR on the mainline is roughly 5m long. Hence we expect to see one ramp for every 1km of 3R. This is possible considering we are only analyzing the 3R on one side of the train, meaning that when we consider both the left and right sides, the actual ramp frequency on a bound is 1 ramp/500m, giving a total of 400 ramps in the entire network.

As a worked example, at 4 ramps per km the cut-off point is 1.5799, this gives 800 ramps in 200km. However, remember that this is applied based on the assumption that ROC has a threshold that can differentiate 3R, ramps and float.

Additionally, it will be best to find out the actual number of ramps in the line and bring the value back to determine which percentile of ROC corresponds to the ROC at ramps (based on the ramp:conductor-rail length ratio)

Note that the NB is primarily made up of floating LVDT readings, while the SB readings is majority 3R conductor rails.

We build on the previous assumption that the 99.5th precentile for the SB is a good cut-off point to determine which ranges of 3R belong to ramps. Below this value, we assume the LVDT is in the float range or contact range.

Immediately, it is noticeable that most of the vertical gauge pairs with high ROC do correspond with the expected vertical gauge ranges of the ramps. However there are also borderline cases where the range is within the expected vertical gauge in the contact range. When we plot the ROC against the vertical gauge ranges, the reason becomes clear: the noise values are large enough create large ROC values that basically merge into the actual ROC values expected at ramps. This means that we need to process the data even further, or find a feature that can distinguish the ROC measurements at the ramp from the ROC measurements from noise.

Under the assumption that noise is random, there should not be a sustained range of records with high ROC due to noise. Perhaps a moving average will help to check for sustained rate of change characteristics of ramps.

Using moving average of rate of change to isolate ramp chainages

Finally, this definitely looks like it has (very?) successfully captured the 3R ramp chainages. The vertical gauge can be seen climbing from a value within the 3R gauge contact range, all the way into the floating range before finding the next ramp. If the moving average (MA) of the ROC is a good way to filter out the ramps, it should be possible to find a clear boundary between the MAROC (moving average rate of change) values of noise in the floating range and the MAROC of noise in addition to minor gauge movements in the contact range. A chart of the MAROC against the vertical gauge will be useful for determining if we have succeeded - and will also be useful for determining the threshold to set MAROC at.

From the 'MAROC v 3R Left Vert' charts and the 'ROC v 3R Left Vert' charts, it is clear that the 3R vertical gauge range from 200-210 belongs strictly to the ramp (no small ROC/MAROC values characteristic of noise/actual minor gauge fluctuations). There is an option of conducting a search spread starting from values in this range, outwards, until the ROC/MAROC hits a value that is too low which indicates the ramp range has terminated. Alternatively, we can try increasing the MAROC window to identify sections of 3R that have a long sustained MAROC and simply label the range as ramps if it contains vertical gauge in the 200-210 range (note that this value may defer from dataset to dataset, but can be easily identified as the first vertical gauge value that has no ROC/MAROC values from the 0-1 range.

Unfortunately, it seems the second option of simply finding a window that can split the three clusters is not feasible, hence it will be necessary to consider the vertical gauge as well. One way could be to get the moving average of the vertical gauge and move it in tandem with the the MAROC to get both the average vertical gauge (it is expected to be higher than the contact range average, and lower than the floating range average) of a given window and its minimum ROC

Note: It is not necessary to match the windows since we are just using these values to characterize the vertical gauge at that point

Again, this is definitely bringing us one step closer to truly isolating the measurements on the ramp from the floats and contact range measurements. By getting the moving average with a forward window, readings that are above the y=x line refer to measurements that are heading into a stretch of increasing vertical gauge measurements - the trail of increasingly large vertical gauge measurements that departs from the contact range and enters the floating range from above the y=x line are instances where the sensor is moving out of a ramp towards the floating range while the trails that lead from the floating range back into the contact range (clockwise) are the instances of the sensor towards a region of lower vertical gauge. Interestingly, perhaps due to the controlled amount of data, it seems almost possible to follow the trails by eye.

Regardless, now that there are several features for us to identify the ramp gauge measurements, it is time to test out different methods of labeling the points and determining what is the best way to label them in the future.

Note that these vertical gauges are a forward window that averages the vertical gauge of 15 rows after the current point.

This labeling seems to miss out some ramp cases, possibly because the MA15 is outside the required range, it might be better to use the absolute difference between the MA15/MMED and the actual measurement to determine if it belongs to a ramp.

Note that MA15 looks forward by 15 rows to get the average of the sensor readings in those rows

This much better, let us take a look at how the labeling looks like on other plots.

The data labeling seems to be missing out the starting and ending of the ramp readings. Maybe the MMED15 filter is removing some of the points due to the threshold being too high. Additionally, now that it is clear that that the 3R_Left_Vert - Left Verti Gauge MA15 plot is a good way to visualize which points belong to ramps and which do not, we will use it to determine the success of the labeling.

Reducing the threshold does pick up more of the ramp labels, but also appears to mislabel some entries in the contact range. Let us try the MA filter with the same threshold.

The MA filter does a much better job than the MMED filter possibly because the MMED when containing a skewed window of data will output a value heavily in favor of the skew instead of assuming a balance like MA. (e.g. 1,1,1,2,2,3,3,6,8,9,10 will give a median of 3 when the average is >4. Let us see if some noise can be removed by using a centered MA to approximate the measurements.

Some mislabeled points were indeed corrected. One last modification which may improve the results is the construction of the ROC values based on a MA of the actual measurements (to reduce the flucatuations in ROC due to noise).

In total, on the NSL, there are 2 bounds, 27 stations 381 ramps at 5.1m each which approximates to around 1943m worth of ramps. Although the SB is a majority 3R side, a similar run was was conducted on the NB Left and the length of ramps was found to be similar, implying that the total distance of ramps is approximately the same on the majority and minority side.

This distance is roughly 390m for the left and right sides, hence by multiplying by 4 to account for left and right sides on both bounds, we will arrive at an approximate total of 1580m worth of ramps. This means that as we make the final tuning for identifying ramp measurements we can afford to relax the thresholds a little to factor in measurements that may be on the borderline of being classified as ramp measurements.

Note: For sides with less 3R, the stretches of 3R are shorter, meaning that proportionally, the ratio of ramps is to 3R is much higher.

HSR%20DRAWING.JPG

Ramp Documentation: Each ramp is roughly 5m long with a slope of 1:50 (a gain of 1mm every 50mm)

Since each reading is approximately 10cm=100mm apart, the gain between every reading at the ramp is expected to be 2mm. With a forward window of size 3, this means that at the reading right before entering the ramp range, the values in the forward rows of the gain will be [0,2,2] inclusive of the reading right before the ramp, and the median of the window will be 2 and the mean will be 4/3.

If our ROC is defined as two rows ahead in this exercise, the expected gain between non-ramp and ramp values is 4mm, hence we expect that at the ramps, the ROC will be approximately 4.

Summary of using average rate of change to isolate ramp measurements:

  1. Use centered MA to approximate point measurements (with a centered window of 3-5)
  2. Use the approximated point measurements to generate the ROC with a step of 2-3
  3. Generate the forward window MA for the approximated points (use a larger window of 10-15)
  4. Generate the MAROC for that point using a MA over the ROC (use a centered window of around 10-20)
  5. Label the points using the:

Additional Ideas: It may be possible to set a looser threshold for the experimented filters above and then add an additional filter made up of a centered MA with a large window of around 15 and filtering by having the average of this large centered MA being above a certain gauge measurement (e.g.165mm). This may be able to remove the noise observed in the contact range cluster.

Implementing the ramp, floating and contact measurement labeling pipeline

First we need to finalize the filter thresholds and methods once and for all. We take one of the more ramp-biased settings and attempt the minimum centered MA measurement threshold filter.

Seems a little too aggresive so tone it down a little.

Get the statistical equivalent of 165 and 225.

Close enough. With this the thresholds can adjust dynamically according to the data (except for the 3mm difference, but this value is related to the geometry of the ramp and not the data specifically). We will apply this feature engineering to the entire data set.

Final Ramp Labeling Pipeline

  1. Use centered MA to approximate point measurements (with a centered window of 3-5)
  2. Use the approximated point measurements to generate the ROC with a step of 2-3
  3. Generate the forward window MA for the approximated points (use a larger window of 10-15)
  4. Generate the MAROC for that point using a MA over the ROC (use a centered window of around 10-20)
  5. Generate large centered MA around point to determine if point lies in definite contact or floating range (use centered window of around 15-25)
  6. Label the points using the:

Any remaining corrections will be done will traversing the data down the chainage (e.g. detecting a chain of ramp labels and a random contact label in between = contact label is wrong, entire ramp chain is below the ramp mean = ramp labels are wrong and its a contact range with bad gauge).

The next step will be to analyze each of these distributions in isolation and normalize them according to a benchmark so that the different distributions can be better compared against one another."

Normalizing distributions of different LVDTs/batches of data

Preparation for cross comparisons of data across different sensors/batches of data

First, we apply the labeling for the entire dataset.

Label the entire dataset (and save as .pkl)

Note that due to the chainage jump, there are duplicate chainages (exactly 2 instances) at specific ranges, this will be fixed by breaking the lengths into a 'before jump' and 'after jump' section before comparison of vertical gauges of specific chainages and labels between different dates.

Also note that there is trash at around Chainage1 == 0.

chainage%20jump%20point.jpg

chainage%20jump%20sector.jpeg

Clearly some kind of chainage marking problems with the 'Chainage2' location marker.

Check the labeled data and it seems that some chainages are covered more than others (chainage duplicate? or some locations measured more than others?)

Everything seems to be in working order, and the labeling algorithm captures the different distributions well.

Get global averages to reverse the normalization of the individual subdfs

It is actually not critical that the values really represent the exact statistical parameters of the data since we are just using these values to generate a normal distribution for the contact range measurements and a uniform-ish distribution for the ramp range.

Based on the above, there are now approximate values we can use to reverse the normalized data. So we may proceed to normalize each dataset collected by each operation.

The statistical assumption here is that:

  1. The samples come from the same population (by conducting the normalization at the sensor level, this is checked)
  2. Comparisons being made across different populations will only be compared against populations in their own category (e.g. normalized ramp data against normalized ramp data)

With the labeled data, we are able to fulfill the requirements above.

Ramp data is roughly uniform, as expected.

Left and right data follows the normal distribution beautifully (with a slightly longer right rail). With the processed data coming out as expected, we can proceed to the final step which is comparing the data between dates.

Alternative method of standardizing data (translation of peak)

This assumes that the difference in distribution peaks are due to a fixed displacement of each sensor due to differences in the set-ups - and the method corrects for this difference in displacement.

Standardizing the distributions about a fixed value is as simple as translating the entire dataset by 170 - mean.

The distribution is almost identical to the method that involves converting each distribution to a standard normal distribution and retains the original standard deviation of all right-sensor readings (4.0915) and left-sensor readings (4.211). In the first method, the data from each sensor is standardized before being combined - essentially going through more steps in centering the mean to 0 and the standard deviation to 1, followed by reconverting the standardized values back into the equivalent actual values. From a procedure efficiency point of view, translation definitely surpasses standardizing each distribution and converting it back to its non-standardized form after recombining the distributions. From the perspective of the resultant distribution, the normal distribution parameters are expectedly similar. A key point to note is that the translation does not standardize the standard deviations of each distribution, the implications of this would be significant if the distributions had large dissimilar standard deviations (potentially warping the bell shaped normal distribution) but in this instance, it is apparent that the standard deviations of each sensor is roughly similar (3.8-4.3), allowing the translation procedure to work well.

Comparing measurements between dates (rate of change over time)

Data from different trains will be considered as following the same distribution based on how we have processed the data so far.

The main comparisons will be done across time with the bound and the side fixed.

Fixing the Chainage Jump

Due to the chainage jump, chainages will first need to be split

Case above shows for a given date, train, bound, the existence of two locations with the same chainage.

Note that because Chainage2 decimal points have been removed, there are multiple same chainages, messes up sorting by Chainage2 slightly. To fix this error, after isolating the post-jump from the pre-jump chainage, there needs to be a sorting done on both the pre-jump and post-jump chainage sets. This is only necessary if entries are being analyzed by index.

Everything seems to be in order after the jump splits.

Technically we are ready to compare the measurements between dates at this point, however it is very likely that there is a chainage misalignment between different runs. This hypothesis was tested in the Annex ('Testing the LVDT chainage mapping').

Time series analysis on LVDT data

Framing:

  1. Split by BOUND - 2 DFs Total
  2. Split by JUMP (Pre/Post) - 4 DFs Total The resulting dataframe has a single CHAINAGE and DATE column with multiple different DATE instances for a given CHAINAGE. Data Insertion:
  3. Split by BOUND - 2 DFs Total
  4. Split by JUMP (Pre/Post) - 4 DFs Total The resulting dataframe has a single CHAINAGE and DATE column with multiple different DATE instances for a given CHAINAGE.

It is apparent that the sub-meter resolution of the chainage mapping is definitely off, attempt a translation to reduce error. This will only work if a translation error is the problem (e.g. slightly different starting point)
Also, notice that the date2 df is quite a few readings short of the date1 df

The statistics show that it is not a translation error, the mean difference would be larger if it was a translation error. This means that the error is distributed randomly at different locations of comparison. It also shows that the min can start from the 160 region which is definitely too early.

If it was a translation error, we would expect to see gauge differences mostly of the same sign (positive or negative). There may be no choice but to write an algorithm that centers each set of ramp readings around the highest points of the readings.

The algorithm will:

1) Check for a sufficiently long series of ramp readings for the ramp to be considered a ramp (relabel ramp label otherwise)
2) Label the ramps in the data from Ramp1 - RampN
3) Match ramps that exist in both datasets (based on chainage) against each other
4) Use a window method to minimize error between the two arrays of ramp measurements
5) Return a dataframe with each ramp number, the average chainage of the ramp number, the average vertical gauge readings and the difference between average vertical gauge readings 

Note that because each run has been individually converted to a standard normal distribution, the gauge difference reflects a relative change in state between the two runs. For example, if all ramps wear the same amount, gauge difference is 0, BUT if one ramp wears more than the other ramps, the other ramps will experience 'growth' instead of wear except for the ramp that has worn more than the other ramps. This is because all of the measurements were fit to a normal distribution. Using this method enables us to find the ramps which experience the most wear by analyzing the Gauge_Difference, while identifying the most-worn ramps using the average 3R_Left_Vert of the ramp. IT DOES NOT give us the actual wear, nor the actual vertical gauge at the ramp (which is used to proxy ramp wear).

Check for a sufficiently long series of ramp readings for the ramp to be considered a ramp

1) *Check for a sufficiently long series of ramp readings for the ramp to be considered a ramp*
2) Label the ramps in the data from Ramp1 - RampN
3) Match ramps that exist in both datasets (based on chainage) against each other
4) Use a window method to minimize error between the two arrays of ramp measurements
5) Return a dataframe with each ramp number, the average chainage of the ramp number, the average vertical gauge readings and the difference between average vertical gauge readings 

Label the ramps in the data from Ramp1 - RampN

Decided to label ramps by their average chainage to remove some work from steps 2-5

1) Check for a sufficiently long series of ramp readings for the ramp to be considered a ramp (relabel ramp label otherwise)
2) *Label the ramps in the data from Ramp1 - RampN*
3) Match ramps that exist in both datasets (based on chainage) against each other
4) Use a window method to minimize error between the two arrays of ramp measurements
5) Return a dataframe with each ramp number, the average chainage of the ramp number, the average vertical gauge readings and the difference between average vertical gauge readings 

Match ramps that exist in two datasets (based on chainage) against each other

Use a window method to minimize error between the two arrays of ramp measurements

Return a dataframe with each ramp number, the average chainage of the ramp number, the average vertical gauge readings and the difference between average vertical gauge readings

1) Check for a sufficiently long series of ramp readings for the ramp to be considered a ramp
2) Label the ramps in the data from Ramp1 - RampN
3) *Match ramps that exist in both datasets (based on chainage) against each other*
4) *Use a window method to minimize error between the two arrays of ramp measurements*
5) *Return a dataframe with each ramp number, the average chainage of the ramp number, the average vertical gauge readings and the difference between average vertical gauge readings*

With the above, the ramps with the (1) fastest wear rate and (2) most worn state can be identified between runs by using the difference between ramp gauge values of different runs and the average ramp gauge values of each run. Noting the moving average of the ranks of the ramps in each operation will identify the ramps with the highest wear rate and is most worn, thereby identifying ramps that require monitoring/intervention.

Generating Panel Data (Multiple Ramps Across Time)

To do so, we will need to cycle the code above (except for the matching, which is used to find differences between measurements across different dates) for different bounds(2 subdfs), left/right (4 subdfs) and jumps (8 subdfs) separately and do so for all dates

ANNEX

Isolating ramp measurement (additional experiments)

Relaxing the MMED requirement caused some points in the contact range to be labeled as ramps. Maybe using just the difference between MA and actual measurements is the best way to go.

This potentially works. Perhaps if the actual measurement is represented a centered moving average and the forward points is represented by a forward moving average it will clean things up

Representing the points as their moving average removed some labels in the contact range that were most likely not ramp points. Let us see if using the ROC MMED to remove points below ROC MMED 1 can filter out the remaining mislabeled points.

Applying the MMED filter removes labels on ramp points that are leaving the ramps and entering the floating range or contact range. Seems like the MMED as an absolute filtering method is not useful even though it seemed to be able to identify the ramp MMED values, too many of the MMED values in the contact and float ranges are similar to the ramp MMED values.

This method catches the entries into reach region well, with a few errors but not so much for the exiting from each region, which is surprising because the absolute of the difference between MA and actual measurement should make the two cases indistinguishable. I think the problem is the MMED, which when containing a skewed window of data will output a value heavily in favor of the skew instead of assuming a balance like MA. (e.g. 1,1,1,2,2,3,3,6,8,9,10 will give a median of 3 - this is a close analogy of the current cases of leaving the clusters, the earlier readings will have low ROC because its in the cluster).

Did not work, could it be the forward bias of the MA and MMED? Meaning that it only looks forward, hence if the ROC of the values in front are small, the MA and MMED will begin factoring this in early and not label the points as a ramp. In that case centering the window for the MAROC and MMED ROC and the would help.

Overlaying data across time (additional experiments)

Interestingly, there seems to be some hardcoded chainage jumps in the Chainage2 data that leaves the readings at certain locations empty

Note that this chainage jump gives the impression that the data is well aligned when in fact it just means the sensor is not indexing readings under specific ranges of chainages

Testing the LVDT chainage mapping

TO DO

Use anomaly detection and classify anomalies into different sections of 3R based on risk?

Noticeable problems

Pivot to classifying 3R sections into categories of riskiness - instead of predicting time to failure (prove impossible with current data)

Logic is that the verified 3R low points can be labeled as risky sections of 3R. There are variables aside from the 3R gauge associated with the low points that can be attributed to the increased risk By running a regression across all variables, if there are variables that are strongly correlated with the 3R low point, we can use that correlation to determine the risk levels at other sections of 3R.

Outstanding Steps:

  1. Determine which data collected is useful for determining 3R low points
  2. Append the useful data collected to the 3R_LOW_POINTS file to create a dataset of low points
  3. Run regressions on it to determine which variables are relevant (One-class Support Vector Machines) sklearn.linear_model.SGDOneClassSVM sklearn.neighbors.LocalOutlierFactor sklearn.ensemble.IsolationForest
  4. Apply regression model to the network and identify which locations are identified as risky
  5. Verify model performance on site

Encoded ZPIA labeled for all entries

Successfully mapped ZPIA, CT and MFV positives to the low-points

Unsupervised Learning for Anomaly Detection

The original workflow of simply using a supervised machine learning model to train on validated cases 3R sag has proven to be unfeasible. Having isolated and labeled the validated cases, the new workflow is to detect anomalies and group that - followed by identifying a group that most closely resembles the validated cases of 3R sag, this will be categorized as 'high risk' sections of 3R. The other anomalous clusters of 3R sections will be studied to determine if there are any insights that can be drawn from them.

The first loading primarily captures variance about the 3R vertical measurements which has a positive covariance with twist and a negative covariance with gauge.

The second loading primarily captures the variance about the gauge which again has a negative covariance with 3R vertical measurements and a positive covariance with twist

The third loading primarily captures the covariance between gauge and twist which is negative, and also captures the covariance between CT and 3R vertical measurements, which is negative.

The relationships generally concurs with the engineering knowledge for this asset. When gauge is positive, it is known as wide gauge and the train sinks lower as the tracks are further apart - hence the negative correlation. For twist, it is less apparent that there is a correlation between 3R vertical measurements and the track twist as the actual effect of twist depends on whether the 3R is on the left or right side of the tracks.

For cable terminals, we expect a negative correlation as these terminals pull down the 3R when present, this relationship is significantly captured in the loadings of PC2 and PC3.

However, in terms of the conclusions drawn from this exercise, it may just mean that 3R low points at these locations are explainable and hence should not be considered 'risky' since the causes are known and the effects are expected. It might not actually mean that these locations are the ones that maintenance efforts should be prioritized on due to risk.

In conclusion, we will focus on the non-anomalous data with low 3R vertical gauge and consider them as the high risk sections of 3R. This corresponds to the green points with low x_composite_1 values.

Since each reading is 10cm, this equates to 900m of high risk 3R sections. The next stage would be to verify if these locations are indeed sections of high risk 3R (3R that are in the midst of undergoing failure?

Next attempts:

Split data into 3 categories based on 3R vertical gauge and then run the ML/regressions to determine correlations between each group and the regressors/variables.

Use regression to confirm that there was no decreasing trend over 6 months

Test to see if the variance was changing over time? Split periods into 3, hypothesis is that the variation in readings should increase with looser 3R bolts

NB Prejump Low Points Regressions

NB Postjump Low Point Regressions

SB Postjump Low Point Regressions

SB Prejump Low Point Regressions